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Abstract: 

We study the statistical properties of hydrophobic/polar model sequences with unique 
native states on the square lattice. It is shown that this ensemble of sequences dif- 
fers from random sequences in significant ways in terms of both the distribution of 
hydrophobicity along the chains and total hydrophobicity. Whenever statistically 
feasible, the analogous calculations are performed for a set of real enzymes, too. 
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1 Introduction 



Functional protein sequences exhibit the ability to fold spontaneously into a unique 
native state (Creighton, 1993). A natural step in order to understand this crucial 
property is to compare good and bad folding sequences in simple models where con- 
formational space can be properly explored. Most such studies have been directed 
toward identifying physical characteristics of good folders, and in this important area 
some progress has been made (Sali et al, 1994; Bryngelson et al, 1995; Klimov 
and Thirumalai, 1998; Nymeyer et al, 1998). In this paper we address the ques- 
tion of how good folders differ from random sequences in purely statistical terms. 
A related but different topic is how sequences that share the same (unique) native 
state are distributed in sequence space. This question and its evolutionary implica- 
tions have recently attracted considerable attention (Li et al, 1996; Bornberg-Bauer, 
1997; Govindarajan and Goldstein, 1997a,b; Bastolla et al, 1999; Broglia et al; 1999; 
Bornberg-Bauer and Chan, 1999; Tiana et al, 2000). 

In a recent study of a hydrophobic/polar off-lattice model, it was found that good 
folders tend to show negative hydrophobicity correlations along the chains (Irback 
et al, 1997). The analogous calculations gave, morevover, qualitatively similar re- 
sults for a major class of real proteins, corresponding to typical total hydrophobicities 
(Irback et al, 1996). On the other hand, the opposite behavior, positive hydropho- 
bicity correlations, has been reported for a class of designed model sequences that 
display certain protein-like features (Khokhlov and Khalatur, 1998, 1999). These 
designed sequences are, for instance, not meant to have unique native states, so the 
different results do not represent a contradiction. However, it shows that sequence 
correlations in proteins is a delicate issue that requires a careful analysis. 

The main goal of this paper is to test the robustness of the conclusion that good 
folding model sequences as well as functional proteins show negative hydrophobicity 
correlations. To this end we perform new calculations for both model and real se- 
quences. The model we study is the minimal HP model on the square lattice (Lau 
and Dill, 1989; Dill et al, 1995). This choice makes it possible for us to improve 
significantly on the statistics in the previous study (Irback et al, 1997), which was 
based on an off-lattice model. The real sequences studied are single-domain enzymes 
taken from the CATH protein structure classification database (Orengo et al, 1997), 
which we hope displays statistical properties representative of functional (globular) 
folding units. With this restriction on protein type, it turns out that the previous, 
somewhat artificial, restriction on total hydrophobicity (Irback et al, 1996) can be 
lifted. 
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2 Methods 



2.1 Sequences 

Let us first define the sequences studied. The real sequences studied are the 173 
nonhomologous single domain enzymes found in the October 1998 release of the 
CATH database (Orengo et al, 1997). These sequences are transformed into binary 
hydrophobicity strings, by taking the six amino acids Leu, He, Val, Phe, Met, and 
Trp as hydrophobic (cr.; = 1) and the others as hydrophilic (<7j = — 1). This choice 
is somewhat arbitrary. Therefore, we also tried a 20-valued hydrophobicity scale, 
which did not affect any of the conclusions below. In CATH, the most general level 
of classification is denoted "class" and describes the relative content of a helices and (3 
sheets. Below, the class dependence of our results is checked by separate calculations 
for each of the three major classes: mainly a, mainly /3, and a(3. A fourth class, 
low secondary structure content, exists but it is not considered separately, as only 3 
of the 173 sequences belong to it. In our calculations we also divide the sequences 
into extracellular and intracellular ones. Following Martin et al. (1998), we take the 
presence of a disulphide bridge as an indicator of extracellular location. The number 
of enzymes in the different subsets studied can be found in Table § below. 

The model we use is the minimal two-dimensional HP model (Lau and Dill, 1989), 
whose behavior is known in quite some detail (Dill et al., 1995). It contains only two 
types of amino acids, H (hydrophobic, er, = 1) and P (polar, a\ = —1), and the chain 
conformation is represented as a self-avoiding walk on a lattice. The formation of a 
hydrophobic core is favored by defining the energy as minus the number of HH pairs 
that are nearest neighbors on the lattice but not along the chain. On the square 
lattice, it turns out that this simple choice of energy function is sufficient in order 
to get a significant number of sequences with unique ground states (Chan and Dill, 
1994; Irback and Sandelin, 1998); complete enumeration of all possible sequences 
and structures shows that the fraction of such sequences is roughly 2% for N < 18. 
Throughout this paper we consider all HP sequences that have unique ground states 
as good folding sequences. Also central is that the sequences are able to fold fast into 
their native states, a requirement that we ignore. This is a reasonable simplification 
because the sequences are short and because almost all have the same energy gap 
between ground state and next lowest level. 



3 



2.2 Sequence Correlations 



Our statistical analysis of hydrophobicity strings can be divided into two parts. The 
first part deals with the distribution of hydrophobicity along the chains; how does a 
"good" sequence with length N and total hydrophobicity 



differ from a typical sequence with the same N and Ml This question can be ad- 
dressed by monitoring variables such as the number of hydrophobic and hydrophilic 
clumps along the chain (White and Jacobs, 1990), Fourier amplitudes (Irback et ai, 
1996), or random walk (Brownian bridge) representations (Pande et ai, 1994). In 
this paper we work with block variables, a widely used technique that has proven 
useful in studies of DNA sequences (Peng et ai, 1992) as well as proteins (Irback 
et ai, 1996). 

In addition to the distribution of hydrophobicity along the chains, we also study the 
distribution of the total hydrophobicity M. This analysis relies entirely on compar- 
isons between observed sequences, which makes it statistically more difficult, espe- 
cially for the real sequences with varying N. 



2.2.1 The Blocking Method 

In this method, for a given size s, the sequence is divided into blocks each consisting 
of s consecutive along the chain. The block variable is then defined as the sum 
of the s ai values in block k (k — 1, . . . , N/s). A useful quantity is the mean-square 
fluctuation 



N 



M = ]>> 



(1) 






where we choose the normalization factor 





independent of N and M. 
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2.2.2 The Distribution of Total Hydrophobicity 



We study the M distribution for different fixed N, focusing on the mean (M) N (the 
subscript indicates fixed N) and the normalized variance 

X=^((M-(M) N f) N . (5) 

It is easily verified that 

4 N 1 
X= m^W 1 ~ hi) + -Tr^Cij , (6) 

i=l ijtj 

where hi = (1 + (<7j) jv)/2 denotes the fraction of sequences that have <7j = 1, and c^- = 
(o'iUj)N — (&i)N(&j)N is the crj,o"j correlation. So, if the <7j values are uncorrelated, 
then 

4 

X^Xi^EMl-^), (7) 

i=l 

which becomes 

X = Xo = 4/i(l - h) (8) 

in case the hydrophobicity profile {hi} is flat with hi = h for all i. Below these two 
predictions are tested for the model sequences. 

Unfortunately, our set of enzymes cannot be analyzed this way, due to limited statis- 
tics. However, as we will see, it turns out that the data for the mean (M) N can be 
approximately described by a simple linear relation, {M) N ss M — (2h — 1)N. As an 
effective measure of the fluctuations in M, we therefore consider 

where the average now is over all sequences, irrespective of N. If the <7j values, for 
each N, were uncorrelated with identical hi = h, then we would have 

X = Xo = 4M1 - h) . (10) 



Let us finally stress that if)^ and x are fundamentally different measurements. In 
the blocking method individual sequences are compared to random sequences with 
the same iV and M. Hence, ip^ provides direct information on the distribution 
of <7j = ±1 along the chains. This is not true for x an d the correlation c^-. This 
correlation is not necessarily physical. The behavior of the analogue of c^- in the 
ordered phase of an Ising magnet provides an illustration of this. In this case, Qj 
does not vanish at large distance, although the physical correlation length is finite. 
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2.3 Individual Structures 



As mentioned in the introduction, several recent model studies have addressed the 
question of how sequences that fold to the same native state are related. In particular, 
using an HP-like model with compact structures only, Li et al. (1996) found that 
structure-preserving mutations tend to be largely independent for highly designable 
structures. To see whether this behavior is consistent with our analysis, we perform 
two measurements for different fixed structures, too. 

Consider a given structure r, and let {h\ } be the corresponding hydrophobicity 
profile (hi is the probability that <7j = 1). The first quantity we calculate is 

A N 

Ax (p) =X (r) -^EM P) (l-M r) ), (11) 
iV i=i 

where x is defined as \ m Eq. |5] but for fixed structure. £±x measures the average 
0£, (7j correlation for fixed structure (see Eq. ||). The second quantity is the entropy 

N 

S=~J2 W In ht ] + (l - hf ] ) In (l - ht ] )] (12) 
i=i 

for a system of independent cr, with hydrophobicity profile {h\ }. If the Oi values are 
approximately independent, then e s provides an order-of-magnitude estimate of the 
actual number of sequences, N r . If this is not the case, then e s overestimates A^ r . 



3 Results 



In this section we present the results of our analyses of the mean-square block fluc- 
tuations ijj^ s ' and the distribution of total hydrophobicity, M, for model and real 
sequences. We end the section with some comments on our model results and related 
studies of similar models. 



3.1 The Blocking Method 
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hi hi h^ hi h$ h& h-j hs hg 



0.794 0.642 0.467 0.456 0.553 0.498 0.526 0.479 0.523 

Table 1: Hydrophobicity profile {hi} for good N = 18 sequences in the HP model. 
By symmetry, hi = hig-i. 




Figure 1: The mean-square block fluctuation ift( s > against block size s for good N = 18 
sequences in the HP model. Shown are results both for the full sequences (+) and 
for the subsequences consisting of the central 14 amino acids (x). The straight line 
represents random sequences; see Eq. |J 



3.1.1 Model Sequences 



In our block variable analysis of HP sequences, we consider the 6349 N = 18 se- 
quences that have unique native states, which can be obtained by exhaustive enu- 
meration (Chan and Dill, 1994). The results are compared to expected values for 
random sequences, as described in Sec. [2.2. 1| . This comparison makes sense only if 



the hydrophobicity profile {hi} is uniform. From Table [l] it can be seen that hi is 
approximately constant in the mid part but increases towards the ends. As a check, 
we therefore calculate the mean-square block fluctuation ip^ in two ways for each 
sequence: first, for the full sequence; and second, after elimination of two amino 
acids at each end. Figure [I] shows the results of both these calculations. We see that 
the average ip^ is smaller than for random sequences, irrespective of whether the 
endpoints are included or not. The conclusion that ip^ s \ on average, is suppressed 
for good sequences is in perfect agreement with earlier results for a different model 
(Irback et al, 1996, 1997). 
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Figure 2: (a) Hydrophobicity profile h(£) for the enzymes. The horizontal line indi- 
cates the mean h m 0.29. (b) ip^ as a function of £ for the enzymes. The horizontal 
line represents random sequences. 



3.1.2 Enzymes 



We now repeat essentially the same analysis for the enzymes. The only difference is 
that, because N is not fixed, the hydrophobicity profile h(£) is taken to be a function 
of the relative position £ along the chains. To calculate h(£), we divide the interval in 
£ from (N end) to 1 (C end) into 100 bins. The results obtained are shown in Fig. 0a. 
We see that h(£) is approximately constant throughout the interval < £ < 1. 

In an earlier block analysis of functional protein sequences (Irback et al, 1996), in 
which there was no restriction on protein type, the ends were found to display a 
different behavior than the rest of the sequences, and therefore they were removed 
from the analysis. To check if this is true for the present data set, we calculate the 
average of ip^ (see Eq. as a function of £, using 25 bins in £. The results are 
shown in Fig. 0b. Although the uncertainties are somewhat large, there is no sign of 
the ends behaving differently. 

Given these two findings, we calculate the block fluctuations using the full sequences, 
without any elimination of amino acids at the ends. 

In Fig. |3]we show the average ip^ against block size s for the 173 enzymes. Also shown 
are the results obtained for five different subsets of these sequences (see Sec. |2~T| ). We 
see that the results are similar in the different cases, and that ip( s > is smaller than for 
random sequences. Qualitatively, the behavior is similar to that found for the model 
sequences. 
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Figure 3: The mean-square block fluctuation ip^ against block size s for different 
groups of enzymes, (a) All sequences (data points connected by dashed line) and 
intracellular (IC) /extracellular (EC) sequences, (b) Division of the sequences into 
three structural classes: mainly a, mainly /3, and af3. The straight lines represent 
random sequences. 

In this analysis we have chosen to focus on ip( s '. Similar deviations from random- 
ness are expected in other quantities such as the number of hydrophobic/hydrophilic 
clumps along the chain. The number of clumps tends to be large when ip^ is small 
(Irback et al, 1997). 



3.2 The Distribution of Total Hydrophobicity 
3.2.1 Model Sequences 

We now turn to the distribution of the total hydrophobicity M . Table |2] shows 
h = (1 + (M)n/N)/2 and the normalized variance x (see Eq. ||) for good HP sequences 
for AT = 12, ... , 18. Also shown in this table are the two predictions Xo an d Xi defined 
in Sec. [2.2.2| , and a prediction xi that will be explained below. Note that h depends 
quite weakly on N. This implies that the fraction of hydrophobic amino acids, unlike 
the core to surface ratio of compact chains, does not increase with N. Of course, it 
would be interesting to see whether this trend persists for much larger N. 

From Table |2| we see that x is smaller than xo > which implies that the a, values are 
not both uncorrelated and uniformly distributed. Comparing to Xi shows that the 
major part of this difference is due to correlations rather than non-uniformity. The 
fact that x < Xi means that the average Cy (i ^ j) is negative. 



9 



N h X Xo Xi X2 



12 0.527 0.577 0.997 0.913 0.589 

13 0.507 0.550 1.000 0.937 0.553 

14 0.519 0.684 0.999 0.924 0.688 

15 0.556 0.594 0.987 0.959 0.593 

16 0.542 0.687 0.993 0.936 0.663 

17 0.555 0.695 0.988 0.961 0.639 

18 0.548 0.718 0.991 0.949 0.646 



Table 2: h — (1 + (M)n/N)/2 and the normalized variance x °f M for good HP 
sequences for different N. Also shown are the three predictions xo (see Eq. |]), xi 
(Eq. [?[) and xi (see Sec. |3T3| ). 

The two measurements h and x are ; °f course, not enough to fully characterize the 
distribution of good sequences. To get an idea of how much information they provide, 
we may compare to the one-dimensional Ising distribution 



The measured values of h and x f° r good N = 18 sequences can be reproduced 
by choosing K\ « —0.16 and K 2 ~ 0.13. For these parameters it turns out that 
e s ~ 1.9 x 10 5 , S being the entropy, which means that the effective number of 
sequences contained in P(cr) is considerably larger than the number of good N = 18 
sequences, 6349. 



3.2.2 Enzymes 

To study the N dependence of the total hydrophobicity M for the enzymes, we 
divide the data set into groups corresponding to different intervals in N. Figure |] 
shows the average M for these groups against N. We see that the N dependence 
is approximately linear. Although the uncertainties are difficult to estimate, it is 
interesting to note that the behavior is in perfect agreement with the model results. 

Next we calculate x i n Eq. using M = N(2h — 1) and h = 0.29, as obtained 
from a fit to the data in Fig. |J Table |3] shows x f° r an sequences and for the 
different subgroups described in Sec. pTT| . We see that % f° r sequences is larger 
than predicted by Eq. 0, which contrasts sharply with the model results above. We 
also note that there seems to be a strong dependence on group. In particular there 




(13) 
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Figure 4: Total hydrophobicity M against N for the enzymes. The data points are 
averages over intervals of length 30 in N. The straight line is a least-square fit. 



Type of chain 


No. sequences 


X 


Xo 


All chains 


173 


1.50±0.27 


0.82 


Intracellular 


127 


0.82±0.13 


0.83 


Extracellular 


46 


2.92±1.15 


0.78 


Mainly a 


23 


1.45±0.25 


0.81 


Mainly (3 


39 


1.63±0.34 


0.77 


a/3 


108 


0.85±0.14 


0.83 



Table 3: Analysis of the fluctuations in M for the enzymes. The quantities x an d Xo 
are defined by Eqs. |9] and [K], respectively. 



appears to be a big difference between intra- and extracellular enzymes. However, it 
must be stressed that the uncertainties are large. Improved statistics are definitely 
needed in order to draw any firm conclusion about the different groups and possible 
deviations from the model results. 



3.3 Comments 

Our study of HP sequences has been focused on structure-independent properties. 
The question of how sequences that share the same (unique) native structure are 
related has recently been examined using similar models (Li et al, 1996; Bornberg- 
Bauer, 1997; Bornberg-Bauer and Chan, 1999). From these studies, a simple picture 
seems to emerge for structures that are highly designable. For high-A^ structures (N r 
is the number of sequences that fold to the structure r), it has been found that the 
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Figure 5: (a) e s /N r , N r and (b) Ax , N r scatter plots for the 1475 designable 
N = 18 structures in the HP model. The shape of the plot symbol indicates whether 
the sequences form a neutral net (+) or not (o). 



sequences tend to form a single cluster connected by one-point mutations, called a 
"neutral net" (Bornberg-Bauer, 1997), and that structure-preserving mutations tend 
to be largely independent (Li et al, 1996). The latter property was observed in a 
model with compact structures only. We checked that it holds in the present model 
too, which is illustrated in Fig. [|. From this figure it can be seen that the quantities 
e s /N r and |A%W|, as defined in Sec. |2.3| , indeed tend to be small for high N r . Also 
indicated in this figure is whether or not the sequences form a neutral net, results 
first obtained by Bornberg-Bauer (1997). 



The fact that structure-preserving mutations are largely independent for high N r 
does not contradict our previous results. To verify this, we calculated x from the 
known hydrophobicity profiles {/if*} under the assumption that the <jj values are 
independent for each structure. The value obtained this way, X2, can be found in 
Table |2| above, and is indeed a relatively good approximation to the observed x- 

Admittedly, the model used in this study is crude. In particular, Buchler and Gold- 
stein (1999, 2000) have recently argued, based on a study of compact lattice chains, 
that the use of a two-letter alphabet leads to designability artifacts, which disappear 
with increasing alphabet size. Let us stress, therefore, that the analyses discussed in 
this paper can be tested on real proteins in a direct manner. Let us also comment 
on the stability of our results. First, we note that the dependence on chain length 
N is weak. This was explicitly shown for x, and is true for ip^ too, although our 
discussion focused on one system size in this case. Second, we note that our results 
are in nice agreement with those obtained earlier using a simple hydrophobic/polar 
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off-lattice model (Irback et al, 1997). To further explore the model dependence of 
our results, we also did calculations for a "solvation-like" two-letter model discussed 
by Ejtehadi et al. (1998a,b) and by Buchler and Goldstein (1999, 2000). This model 
differs from the HP model in that the interaction strength is additive [e(H,H) = — 2e, 
e(H,P) = — e and e(P,P) = 0], which means that the total energy can be expressed as 
a simple sum of monomer contributions. Buchler and Goldstein argued that HP-like 
models, unlike pair-contact models with larger alphabets, tend to have solvation- 
like designability properties. It is therefore interesting to note that when analyzing 
sequences with unique ground states in the solvation-like model defined above, we 
obtained results qualitatively different from those for the HP model. More precisely, 
it turns out that the block fluctuations are significantly larger, close to random, for 
the solvation-like model. 



4 Summary and Discussion 



Hydrophobicity plays a key role in the formation of protein structures, which makes it 
of utmost interest to understand the statistical distribution of hydrophobicity along 
the chains. In this paper we have analyzed hydrophobic/polar sequences in the 
two-dimensional HP lattice model. Whenever statistically feasible, the analogous 
calculations were performed for a set of real enzymes, too. Our main findings are as 
follows. 

• Both model sequences and enzymes show mean-square block fluctuations ip^ 
that are smaller than for random sequences. In particular, this implies that the 
enzymes display the same behavior that had been found previously for general 
proteins with typical total hydrophobicities (Irback et al., 1996). The present 
analysis was performed without any restriction on total hydrophobicity. 

• The average total hydrophobicity M varies approximately linearly with chain 
length N over the range of N studied, both for model sequences and enzymes. 
This implies, contrary to what one naively might expect, that the fraction of 
hydrophobic amino acids does not grow with increasing N. The fluctuations in 
M are difficult to study for the enzymes, due to statistical uncertainties. For 
the model sequences it turns out that the normalized variance x is significantly 
smaller than for random sequences. 

We also divided the enzymes into different groups according to their structural con- 
tent, and to whether they reside in an intra- or extracellular environment. The 
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fluctuations in total hydrophobicty appeared to depend on group. However, whether 
this dependence is significant or not is difficult to say, due to statistical uncertain- 
ties. The mean-square block fluctuations are statistcally much easier to measure, and 
show only a weak dependence on group. The conclusion that is suppressed is, in 
particular, the same for all the different groups. 

A full explanation of the suppression of is probably hard to give. Let us note, 
however, that long hydrophobic or hydrophilic stretches in the amino acid sequence 
are likely to lead to degenerate structures, and the suppression of sequences containing 
such stretches should indeed tend to make ip^ smaller. 

The nonrandomness of the block fluctuations provides an indirect confirmation of 
the important role played by hydrophobicity in the formation of protein structures. 
Furthermore, it is tempting to take the similarity with the model results as an in- 
dication that the ability to form a stable structure represents a significant selective 
advantage in the evolution of proteins. It would be interesting to check that the 
behavior remains the same in more realistic models. 
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